-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to define http/s health checks for Network Loadbalancer #2708
Comments
Hi, Same issue here with the path. Looking the Aws doc http://docs.aws.amazon.com/fr_fr/elasticloadbalancing/latest/APIReference/API_CreateTargetGroup.html HTTP health check param Might be related to a6d1266 |
I just found this error myself. Terraform 0.11.1
Example code
|
I ran into this issue as well. My theory is that it is checking the rules based on the target group protocol (which in this case is TCP) when it should be checking against the health check protocol (which is HTTPS). Both path and matcher are not applicable to TCP health checks, but they are to HTTP and HTTPS health checks, even if the target group's protocol is TCP. I might also mention that even version 1.5 had some wonky behavior regarding health checks on network load balancers / TCP target groups. This is because network load balancers have certain restrictions on the health checks that application load balancers do not: The unhealthy threshold must equal the healthy threshold, timeout is fixed to 10 and cannot be changed, interval has only two possible values (10 and 30), and matcher (if applicable) is fixed to 200-399. Unfortunately, terraform would still try to manage these parameters even if they weren't supplied. This led to errors in some cases. In others, it would work upon creation, but then a subsequent apply would detect changes to those parameters and attempt to fix them, resulting in errors. This made it necessary to specify all the parameters and make sure they were valid for network load balancers. For instance, In any case, the rules will have to take into account not only the protocol of the target group (which determines whether it's an alb or nlb) but also the protocol of the health check. |
Hi all! Sorry for the regression here. It seems that this is caused by the additional validation checks added in #2380. The goal of these changes was to catch more errors at plan time that were previously only caught at apply time, regarding the various subtle differences between application and network load balancers. @deftflux is correct that the validation code is checking the target group protocol to recognize if a given target group is an application or network target group, but indeed it does seem like the health check protocol is the correct thing to check for this case to match, per the relevant API documentation which describes this particular property ( It seems that the same bug exists for |
After playing with this some more it seems like the current validation is correct here, per what's enforced by the underlying API. After weakening the check in the provider, I see the following error from the remote API during apply:
The logic I'd implemented -- based on the documentation -- was to allow custom health check matchers if the healthcheck protocol is HTTP, but it seems that there is an undocumented additional restriction that However, I see that you all saw something working prior to this validation being added, so I'm now trying to figure out what the old implementation (prior to 1.6) was actually doing in this scenario that was allowing it to work. |
As far as I can tell, this was only working before because it was totally ignoring these attributes: So while indeed this wasn't an error before, it seems like it was never actually working. In principle we could restore the previous behavior of just silently ignoring these arguments for TCP target groups, but that seems counter to Terraform's usually goal of doing what it says it will do or failing loudly if it can't. Given that these attributes were not functional before anyway, I'd like to propose that we move forward with these additional checks in place (arguably it was a bug that these checks were not present before) and require removing these previously-non-functional attributes from configuration when upgrading to 1.6 and above. Of course we ideally would've noticed this change in behavior and included it in the 1.6 changelog, which we can do now retroactively although it won't be visible within the Please let me know if any of you have a use-case where including these arguments even though they are ignored is important; we can then think about how we might strike a compromise to retain the now-more-correct validation while still making those use-cases work. Sorry for the accidental undocumented compatibility break here! 😖 |
Thanks for looking into this @apparentlymart ! I did a little testing, and there is still a problem with 1.6. Consider this test configuration: resource "aws_lb_target_group" "foo" {
name = "tf-nlb-health-check-test"
protocol = "TCP"
port = "1234"
vpc_id = "${local.vpc}"
health_check {
protocol = "HTTPS"
port = 12345
#path = "/custom/path"
#matcher = "200-399"
interval = 30
#timeout = 10
healthy_threshold = 3
unhealthy_threshold = 3
}
} The lines that are commented out are the ones that I used in 1.5 but that are considered invalid now in 1.6. In 1.5 with those lines uncommented, it works for both creation and subsequent With those lines commented out, 1.6 will create the target group successfully. However, subsequent
Apparently, terraform is detecting a change in the matcher, presumably because normally the default matcher is assumed by terraform to be So it looks like these attributes are being correctly ignored when creating a TCP target group, but not when managing an existing TCP target group. As far as requiring the removal of these attributes, I suppose it is a bit of a bug that we had to explicitly specify them previously. But we could still maintain backwards compatibility if we allow the only valid value to be specified as the commented lines above. I would definitely be in favor of at least fixing it so that we do not have to specify those attributes, however. |
I'm currently experiencing the same problem. I created an
When i downgrade to
How is an upgrade suppose to happen with the current implementation? |
You are right, @arminbuerkle. Currently in 1.6, there is no way to have an HTTP/S health check for a TCP target group without getting errors on subsequent |
Thanks for that extra detail @deftflux, and sorry for the silence while I was on my holiday break. Indeed it does seem like there is an issue here based on what you described. I suppose what's happening here is that we're reading back some server-provided defaults from the API that are then causing validation to fail. Probably the best solution for that would be to add some extra checks to the |
I guess I may have gotten a bit lost in the discussion here, but I'm hitting this issue as well. Via both |
Hi @jantman yes, you are correct, you can create TCP Target Groups with HTTP health checks, and Terraform should be letting us too, but isn't right now. What you can't do it create a TCP Target Group with and TCP health check, and then change the health check to HTTP (likewise for any protocol change even HTTP -> HTTPS). Changing the health check protocol requires destroying and recreating the Target Group. Likewise changing unhealthy threshold, timeout, interval, and success codes (matcher), can be set on creation, but all require the Target Group to be recreated to change them (which seems harsh!). Another note about 'interval', I note in the UI that for HTTP/HTTPS health checks you can freely set the interval, but for TCP health check you can only set 10 seconds or 30 seconds. That might be an API restriction too. Though I guess Terraform could leave that validation up to AWS. |
This has been released in terraform-provider-aws version 1.7.0. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading. |
PR #2906 lets you create HTTP/HTTPS health checks now 🎉, but it doesn't fully resolve validation. The protocol can be changed between HTTP and HTTPS, but changing to or from TCP should trigger a recreation plan for the Target Group.
|
@whereisaaron I just ran into issue switch from TCP to HTTP/HTTPs (it needs to be re-created and terraform doesn't know that); was a separate issue ever opened to address that? |
@iancward no sorry, I don't know any issue for this bug where recreation isn't triggered. I currently have to manually intervene. |
@whereisaaron it doesnt let me change to a custom healthcheck with TCP protocol. While it lets me in AWS console. So i dont think this is something rejected by AWS api's |
The server currently 404s the root path. From a master: [core@ip-10-0-26-134 ~]$ curl -ik https://wking-api.devcluster.openshift.com:49500/ HTTP/1.1 404 Not Found Content-Type: text/plain; charset=utf-8 X-Content-Type-Options: nosniff Date: Sun, 16 Dec 2018 06:30:14 GMT Content-Length: 19 404 page not found but we need a reliable response on the range 200-399 to satisfy our network load balancer health checks, which do not support configurable response status codes [1,2] (these are Terraform links, but they discuss an AWS restriction that is not Terraform-specific). This commit adds a /healthz endpoint which always 204s (when the server is alive to handle it). [1]: hashicorp/terraform-provider-aws#2708 (comment) [2]: https://github.com/terraform-providers/terraform-provider-aws/pull/2906/files#diff-375aea487c27a6ada86edfd817ba2401R612
As suggested by Dani Comnea [1]. When we switched to network load balancers in 16dfbb3 (data/aws: use nlbs instead of elbs, 2018-11-01, openshift#594), we replaced things like: resource "aws_elb" "api_internal" { ... health_check { healthy_threshold = 2 unhealthy_threshold = 2 timeout = 3 target = "SSL:6443" interval = 5 } ... } with: resource "aws_lb_target_group" "api_internal" { ... health_check { healthy_threshold = 3 unhealthy_threshold = 3 interval = 10 port = 6443 protocol = "TCP" } } This resulted in logs like [2]: [core@ip-10-0-11-88 ~]$ sudo crictl ps CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT 1bf4870ea6eea registry.svc.ci.openshift.org/openshift/origin-v4.0-2018-12-15-160933@sha256:97eac256dde260e8bee9a5948efce5edb879dc6cb522a0352567010285378a56 2 minutes ago Running machine-config-server 0 [core@ip-10-0-11-88 ~]$ sudo crictl logs 1bf4870ea6eea I1215 20:23:07.088210 1 bootstrap.go:37] Version: 3.11.0-356-gb7ffe0c7-dirty I1215 20:23:07.088554 1 api.go:54] launching server I1215 20:23:07.088571 1 api.go:54] launching server 2018/12/15 20:24:17 http: TLS handshake error from 10.0.20.86:28372: EOF 2018/12/15 20:24:18 http: TLS handshake error from 10.0.20.86:38438: EOF 2018/12/15 20:24:18 http: TLS handshake error from 10.0.47.69:26320: EOF ... when the health check opens a TCP connection (in this case to the machine-config server on 49500) and then hangs up without completing the TLS handshake. Network load balancers [3,4] do not have an analog to the classic load balancers' SSL protocol [5,6,7], so we're using HTTPS. There's some discussion in [8] about the best way to perform unauthenticated liveness checks on the Kubernetes API server that suggests 401s are possible in some configurations. Checking against a recent cluster: $ curl -i https://wking-api.devcluster.openshift.com:6443/healthz curl: (60) Peer's Certificate issuer is not recognized. More details here: http://curl.haxx.se/docs/sslcerts.html curl performs SSL certificate verification by default, using a "bundle" of Certificate Authority (CA) public keys (CA certs). If the default bundle file isn't adequate, you can specify an alternate file using the --cacert option. If this HTTPS server uses a certificate signed by a CA represented in the bundle, the certificate verification probably failed due to a problem with the certificate (it might be expired, or the name might not match the domain name in the URL). If you'd like to turn off curl's verification of the certificate, use the -k (or --insecure) option. $ curl -ik https://wking-api.devcluster.openshift.com:6443/healthz HTTP/1.1 200 OK Cache-Control: no-store Date: Sun, 16 Dec 2018 06:18:23 GMT Content-Length: 2 Content-Type: text/plain; charset=utf-8 ok I don't know if the network load balancer health checks care about certificate validity or not. I guess we'll see how CI testing handles this. Ignition is only exposed inside the cluster, and checking that from a master node: [core@ip-10-0-26-134 ~]$ curl -i https://wking-api.devcluster.openshift.com:49500/ curl: (60) Peer's Certificate issuer is not recognized. More details here: http://curl.haxx.se/docs/sslcerts.html curl performs SSL certificate verification by default, using a "bundle" of Certificate Authority (CA) public keys (CA certs). If the default bundle file isn't adequate, you can specify an alternate file using the --cacert option. If this HTTPS server uses a certificate signed by a CA represented in the bundle, the certificate verification probably failed due to a problem with the certificate (it might be expired, or the name might not match the domain name in the URL). If you'd like to turn off curl's verification of the certificate, use the -k (or --insecure) option. [core@ip-10-0-26-134 ~]$ curl -ik https://wking-api.devcluster.openshift.com:49500/ HTTP/1.1 404 Not Found Content-Type: text/plain; charset=utf-8 X-Content-Type-Options: nosniff Date: Sun, 16 Dec 2018 06:30:14 GMT Content-Length: 19 404 page not found So we're checking the new /healthz from openshift/machine-config-operator@d0a7ae21 (server: Add /healthz, 2019-01-04, openshift/machine-config-operator#267) instead. Unfortunately, setting matcher [9] is not allowed for network load balancers (e.g. see [10,11]). Setting it leads to errors like: ERROR * module.vpc.aws_lb_target_group.api_internal: 1 error occurred: ERROR * aws_lb_target_group.api_internal: Error creating LB Target Group: InvalidConfigurationRequest: Custom health check matchers are not supported for health checks for target groups with the TCP protocol ERROR status code: 400, request id: 25a53d63-00fe-11e9-80c5-59885e191c9c So I've left it unset here, and we'll just hope the 401s don't start happening. [1]: openshift#923 [2]: https://groups.google.com/d/msg/openshift-4-dev-preview/Jmt6AK0EJR4/Ed3W7yZyBQAJ [3]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html [4]: https://www.terraform.io/docs/providers/aws/r/lb_target_group.html#protocol [5]: https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-healthchecks.html [6]: https://www.terraform.io/docs/providers/aws/r/elb.html#target [7]: hashicorp/terraform-provider-aws#6866 [8]: kubernetes/kubernetes#43784 [9]: https://www.terraform.io/docs/providers/aws/r/lb_target_group.html#matcher [10]: https://github.com/terraform-providers/terraform-provider-aws/pull/2906/files#diff-375aea487c27a6ada86edfd817ba2401R612 [11]: hashicorp/terraform-provider-aws#2708 (comment)
Unfortunately problem persist for Terraform v0.11.11
|
@MichalPloski yes, this looks to still be validated along with other similar parameters: https://github.com/terraform-providers/terraform-provider-aws/blob/d0edc835f07ef347937892b691b9ab0a602b2372/aws/resource_aws_lb_target_group.go#L673 That check technically allows 0 to be set in the case of TCP health check but unfortunately, timeout param fails validation against the allowed range 2-60. With
|
Why is this closed? seems to still not be working? Or maybe there is a workaround? |
@red8888 It is still an issue for me because I would like to re-use the |
Is there workaround for this? Why this was closed? |
I am using provider.aws: version = "~> 2.23", and I am still facing this issue, Let me know how the issue can be resolved or any workaround. |
Had the same issue. TF version 0.11.4. Provider 2.27.0. Solved it by setting matcher to 200~399. Anything else would fail with the same error the rest are having. |
I have made the changes which you have suggested, but still facing the same issue. below is NLB target group resource , resource "aws_lb_target_group" "nlb_target" { health_check { Let me know if I need to config any additional params? |
I have fixed the issue, by setting below values, |
as per https://github.com/terraform-providers/terraform-provider-aws/issues/2708\#issuecomment-353041848 and try to use HTTP as protocol to support a health check path
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks! |
Hi there,
Thank you for opening an issue. Please note that we try to keep the Terraform issue tracker reserved for bug reports and feature requests. For general usage questions, please see: https://www.terraform.io/community.html.
Terraform Version
Terraform 0.11.1 with AWS Provider 1.6
Affected Resource(s)
Please list the resources as a list, for example:
If this issue appears to affect multiple resources, it may be an issue with Terraform's core, so please mention this.
Terraform Configuration Files
Debug Output
Please provider a link to a GitHub Gist containing the complete debug output: https://www.terraform.io/docs/internals/debugging.html. Please do NOT paste the debug output in the issue; just paste a link to the Gist.
Expected Behavior
Setup the expected network loadbalancer like in 1.5
Actual Behavior
Steps to Reproduce
Please list the steps required to reproduce the issue, for example:
terraform apply
Important Factoids
N/A
References
N/A
The text was updated successfully, but these errors were encountered: